Homologous Pairs of Low and High Temperature Originating Proteins Spanning the Known Prokaryotic Universe

Stability of proteins at high temperature has been a topic of interest for many years, as this attribute is favourable for applications ranging from therapeutics to industrial chemical manufacturing. Our current understanding and methods for designing high-temperature stability into target proteins are inadequate. To drive innovation in this space, we have curated a large dataset, learn2thermDB, of protein-temperature examples, totalling 24 million instances, and paired proteins across temperatures based on homology, yielding 69 million protein pairs - orders of magnitude larger than the current largest. This important step of pairing allows for study of high-temperature stability in a sequence-dependent manner in the big data era. The data pipeline is parameterized and open, allowing it to be tuned by downstream users. We further show that the data contains signal for deep learning. This data offers a new doorway towards thermal stability design models.


S1. Pipeline steps and parameters
The pipeline for downloading raw data, to validation of protein pairs is runnable with a single command: 'dvc repro'.Alternatively you can run individual stages of the pipeline with 'dvc repro -s <name_of_stage>'.Many of the stages in the pipeline have parameters that affect the runtime behaviour and stage outputs, thus the pipeline is tunable by modifying 'params.yaml'.Each stage in the pipeline is outlined below, and the parameters are given in Table S1.

Parameters
The full list of parameters and their current values for this work is given in Table S1 below: Note: Start and end positions of the alignment on a strand is also tracked as query/subject_align_start/end.The difference between start and end, e.g., the length of the alignment per strand is tracked as query/subject_align_len.

S3. 16s rRNA Alignment
Taxa Pairs were determined via alignment of 16s rRNA strands from Refseq using BLASTn 2.14.0+.The parameters for the search are shown in Table S5  The best HSP as determined as the one with the most coverage of both strands (averaged) was then retained for each putative pair.Taxa were labelled as pairs if the resulting coverage for both strands was >=98.5% and the gap compressed percent identity >=81%.

Reported alignment
Protein Pairs were determined via alignment of UniProtKB sequence using DIAMOND version 2.1.6,a BLAST-like software.The parameters for the search are shown in Table S6 below, default if not listed: The best HSP as determined as the one with the most coverage of both strands (averaged) was then retained for the pair and is reported in the dataset with a number of alignment metrics (see Section S2).

Cost of alignment
Before running protein alignment on many taxa pairs and potentially wasting compute, we compared the use of BLASTp to DIAMOND using identical parameters wherever possible, eg.word size, matrix, penalties, etc.This was conducted for 10k random proteins vs 10k random proteins.The compute time, carbon, and return on investment in terms of number of hits is shown below if Figure S1.We chose to use DIAMOND for alignment given the small decrease in sensitivity compared to near order of magnitude cost reduction.

S5. Data Schema
Figure S2: Schema for presented database.The 'taxa' and 'proteins' tables contain raw data from UniProtKB, RefSeq, and Engqvist.The 'taxa_pairs' and 'pairs' tables contain information on alignment of 16s rRNA and protein sequences.See Table S7 below for a description of each field in the database.
Table S7: Data fields within the presented relational database.

S6. Pfam annotations
Pfam was used to annotate proteins in protein pairs.For a given protein pair, both proteins were run against 19,632 HMMs from Pfam version 35.0 using pyhmmer. 2 All annotations with sequence hit E-value < 1e -10 were retained.The set Jaccard score was then computed for annotations between two proteins, as defined below: /data/ogt_protein_classifier/model, ) The final model model was saved and evaluated against the test set.It can be found on the Huggingface Hub, here. 8

S8. ESM Mapping
To produce a 2D mapping of proteins (Figure 1B), we used ESM2 embeddings. 9First, embeddings from ESM Atlas were retrieved for all proteins with Tm>0.9 and pLDDT>0.9.These are the proteins that the ESM language model's structural prediction module is most confident in, eg.those that are not likely extrapolative given the training sequences/structures.The embeddings represent the mean over sequence tokens of the pretrained model's output latent space, producing a single size 2,560 vector for each protein sequence.
Due to computational cost, we sampled data in order to make dimensionality reduction achievable.

Figure S1 :
Figure S1: BLASTp vs DIAMOND resource test of protein alignment.DIAMOND significantly reduces cost while retaining most of the sensitivity.

Table S1 :
All parameters in the pipeline to modulate the final result.

Table S2 :
Parameters for BLASTp, if using it for protein alignment

Table S3 :
Parameters for DIAMOND, if using it for protein alignment * Note 2: To threshold by E value less than 1e -10 , we would have a nested specification like the following:

Table S4 :
Names and definitions of alignment metrics computed.

Table S5 :
below, default if not listed: BLASTn parameters used.

Table S8 :
Only 500k proteins from the ESM Atlas were retained by random sampling.We then randomly sampled proteins from pairs from ours and Hait et al.'s dataset, retaining the ratio of dataset sizes to each other and to the ESM Atlas.This resulted in 240k proteins from our dataset, and 82 from Hait et al.'s.We then used the same pretrained model version ('esm2_t36_3B_UR50D') as was used for ESM Atlas to embed our's and Hait et al.'s proteins.Finally, we ran multicore T-SNE on the embedded vectors down to two dimensions.[]We used the following parameters, the rest default: Parameters used for T-SNE dimensionality reduction of ESM embeddings.